This code is a demonstration of applying ARIMA and SARIMA to a time series data

1 Loading necessary libraries

options(warn=-1)
library(astsa) #provides time series model functions (arima, sarima)
library(TSstudio) #plotly library for plotting time series data
library(xts) #library to convert data frame to time series object
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

1.0.1 Load the daily female births in California data set

birth.data=read.csv("daily-total-female-births-CA.csv")
head(birth.data)
str(birth.data)
## 'data.frame':    365 obs. of  2 variables:
##  $ date  : chr  "1959-01-01" "1959-01-02" "1959-01-03" "1959-01-04" ...
##  $ births: int  35 32 30 31 44 29 45 43 38 27 ...

1.0.2 Converting dates in string format to datetime object

birth.data$date<-as.Date(birth.data$date)
head(birth.data)
#creating a separate variable for just the number of births for further calculations
number_of_births<-birth.data$births

1.0.3 Converting the data frame object to time series object for plotting

#converting data frame to time series object
birth.ts <- xts(birth.data[,-1], order.by=birth.data[,1])
#rename second column to appropriate name
names(birth.ts)[1] <- "births"
head(birth.ts)
##            births
## 1959-01-01     35
## 1959-01-02     32
## 1959-01-03     30
## 1959-01-04     31
## 1959-01-05     44
## 1959-01-06     29

2 Plotting the time series data

ts_plot(birth.ts, title = "Daily total female births in California in 1959",
        Xtitle = "Month",
        Ytitle = "Number of female births",
        slider=TRUE)

There is definitely some trend in the time series

3 Ljung-Box test to check for correlations between lags

Box.test(x=birth.ts,lag=log(length(birth.ts)))
## 
##  Box-Pierce test
## 
## data:  birth.ts
## X-squared = 36.391, df = 5.8999, p-value = 2.088e-06

Since the p-value is less than 0.05, we reject the Null Hypothesis and conclude that there is significant autocorrelation between the lags in the time series data

4 Remove trend

In order to remove the trend, we use the difference operator on the number of births. Specifically, a new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step.

value(t) = observation(t) - observation(t-1)

ts_plot(diff(birth.ts),title='Differenced data')
#hence, the trend is removed with outliers at Sept 2 and 3

Again checking the p-values for the above differenced data.

Box.test(diff(birth.ts),lag=log(length(birth.ts)))
## 
##  Box-Pierce test
## 
## data:  diff(birth.ts)
## X-squared = 78.094, df = 5.8999, p-value = 7.661e-15
#p-value is again very less than 0.05. therefore, autocorrelation is still there and cannot be further eliminated. 

5 Building the ARIMA model: Deciding parameters for the model through ACF and PACF

par(mfrow=c(2,1))
acf(diff(number_of_births))
pacf(diff(number_of_births))

Hence, we can make out that:

  1. ACF- there is one significant spike
  2. PACF- there are significant spikes up until lag 7

We create 4 models with: p=0,7, d=1(since we have differenced the Time Series once) and q=1,2

model1<-arima(birth.ts,order=c(0,1,1))
model1.test<-Box.test(model1$residuals,lag=log(length(model1$residuals)))

model2<-arima(birth.ts,order=c(7,1,1))
model2.test<-Box.test(model2$residuals,lag=log(length(model2$residuals)))

model3<- arima(birth.ts,order=c(0,1,2))
model3.test<-Box.test(model3$residuals,lag=log(length(model3$residuals)))

model4<-arima(birth.ts,order=c(7,1,2))
model4.test<-Box.test(model4$residuals,lag=log(length(model4$residuals)))


#collecting all data into a single data frame
df<-data.frame(row.names=c("AIC","p-value"),c(model1$aic,model1.test$p.value),
               c(model2$aic,model2.test$p.value),
               c(model3$aic,model3.test$p.value),
               c(model4$aic,model4.test$p.value))
colnames(df)<-c('Arima(0,1,1)','Arima(7,1,1)', 'Arima(0,1,2)', 'Arima(7,1,2)')
df

6 Select best model

We select the model with lowest AIC i.e. ARIMA(0,1,2)

7 Fitting the SARIMA Model: Does this perform better than the ARIMA model?

sarima(birth.ts, 0,1,2,0,0,0)
## initial  value 2.216721 
## iter   2 value 2.047518
## iter   3 value 1.974780
## iter   4 value 1.966955
## iter   5 value 1.958906
## iter   6 value 1.952299
## iter   7 value 1.951439
## iter   8 value 1.950801
## iter   9 value 1.950797
## iter  10 value 1.950650
## iter  11 value 1.950646
## iter  12 value 1.950638
## iter  13 value 1.950635
## iter  13 value 1.950635
## iter  13 value 1.950635
## final  value 1.950635 
## converged
## initial  value 1.950708 
## iter   2 value 1.950564
## iter   3 value 1.950290
## iter   4 value 1.950196
## iter   5 value 1.950185
## iter   6 value 1.950185
## iter   7 value 1.950185
## iter   7 value 1.950185
## iter   7 value 1.950185
## final  value 1.950185 
## converged

## $fit
## 
## Call:
## arima(x = xdata, order = c(p, d, q), seasonal = list(order = c(P, D, Q), period = S), 
##     xreg = constant, transform.pars = trans, fixed = fixed, optim.control = list(trace = trc, 
##         REPORT = 1, reltol = tol))
## 
## Coefficients:
##           ma1      ma2  constant
##       -0.8511  -0.1113     0.015
## s.e.   0.0496   0.0502     0.015
## 
## sigma^2 estimated as 49.08:  log likelihood = -1226.36,  aic = 2460.72
## 
## $degrees_of_freedom
## [1] 361
## 
## $ttable
##          Estimate     SE  t.value p.value
## ma1       -0.8511 0.0496 -17.1448  0.0000
## ma2       -0.1113 0.0502  -2.2164  0.0273
## constant   0.0150 0.0150   1.0007  0.3176
## 
## $AIC
## [1] 6.760225
## 
## $AICc
## [1] 6.760408
## 
## $BIC
## [1] 6.803051

Looking at the p-values of Moving Averages, there is no autocorrelation between the lags. The AIC of the SARIMA model is somewhat similar to that of ARIMA model. The Q-Q plot is almost linear.